Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_exllamav2 #1419

Merged
merged 8 commits into from
Oct 24, 2023
Merged

add_exllamav2 #1419

merged 8 commits into from
Oct 24, 2023

Conversation

SunMarc
Copy link
Member

@SunMarc SunMarc commented Sep 27, 2023

What does this PR do ?

This PR adds the possibility to choose exllamav2 kernels for GPTQ model This PR follows the integration of the kernels in auto-gptq. I've also added a test to check that we are able to load and infer using exllamav2 kernel. I will update the benchmark in a follow-up PR.

  • Merge after the release of auto-gptq ?

@SunMarc SunMarc mentioned this pull request Sep 27, 2023
3 tasks
docs/source/llm_quantization/usage_guides/quantization.mdx Outdated Show resolved Hide resolved
docs/source/llm_quantization/usage_guides/quantization.mdx Outdated Show resolved Hide resolved
optimum/gptq/quantizer.py Outdated Show resolved Hide resolved
optimum/gptq/quantizer.py Show resolved Hide resolved
tests/gptq/test_quantization.py Show resolved Hide resolved
tests/gptq/test_quantization.py Show resolved Hide resolved
Comment on lines +235 to +241
def test_generate_quality(self):
# don't need to test
pass

def test_serialization(self):
# don't need to test
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we quantize the model using the cuda-old kernel and save the model to later load it with exllamav2 for the test_exllama_serialization test. Since these tests will use the cuda-old kernel, we don't need to test them as we already do so in a previous test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how about generate_quality?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also tested in the GPTQTest class. The wording is confusing but test_exllama_serialization in GPTQTestExllamav2 does two things: test the loading quantized weights with exllamav2 kernels + test the inference correctness.

@SunMarc SunMarc requested a review from fxmarty October 16, 2023 18:20
Copy link
Contributor

@fxmarty fxmarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, that's great! As commented on transformers PR by somebody else, that's true that the inflation of disable args is not very scalable, that was probably a bad idea on my end to use that in AutoGPTQ.

docs/source/llm_quantization/usage_guides/quantization.mdx Outdated Show resolved Hide resolved
Comment on lines +235 to +241
def test_generate_quality(self):
# don't need to test
pass

def test_serialization(self):
# don't need to test
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And how about generate_quality?

docs/source/llm_quantization/usage_guides/quantization.mdx Outdated Show resolved Hide resolved
@SunMarc SunMarc merged commit aba7f46 into huggingface:main Oct 24, 2023
50 of 52 checks passed
@SunMarc SunMarc deleted the add_exllamav2 branch October 24, 2023 17:58
@SunMarc SunMarc restored the add_exllamav2 branch October 24, 2023 18:49
SunMarc added a commit that referenced this pull request Oct 24, 2023
@achew010
Copy link

achew010 commented Apr 4, 2024

Hi @SunMarc / @fxmarty,

I was running some QPeft experiments and it looks like optimum's GPTQ interface does not work with the exllama kernel for finetuning.
The loss diverges (as compared to using default cuda), See figure below.

drawing

I suspect this is likely due to the absence of the backward function in the exllama kernel. I'm wondering if you guys are aware of this behaviour?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants